The Data Preparation phase of the IBM's CRISP model is about preparing the final dataset to be used in the later stages. In this phase, the table, record, and attribute selection, as well as transformation and cleaning are considered. Practice shows that these tasks are likely to be performed multiple times.
import pandas as pd
import os
import numpy as np
from os import listdir
import matplotlib.pyplot as plt
from PIL import Image as PImage
data = pd.read_excel ('./Flowers-new.xlsx')
path = './Eindhoven_Flower_DataSet_2021/'
data.head()
| date | landscape | location | photo_id | english_name | unit | other_flower | |
|---|---|---|---|---|---|---|---|
| 0 | 2 April, 2021 | Woodland | Helmond | Warandabos | _B4A2382 | Wood Anemone | Single flower | NaN |
| 1 | NaN | Woodland | Helmond | Warandabos | _B4A2389 | Wood Anemone | Single flower | NaN |
| 2 | NaN | Woodland | Helmond | Warandabos | _B4A2399 | Wood Anemone | Single flower | NaN |
| 3 | NaN | Woodland | Helmond | Warandabos | _B4A2402 | Wood Anemone | Single flower | NaN |
| 4 | NaN | Woodland | Helmond | Warandabos | _B4A2403 | Wood Anemone | Single flower | NaN |
data['english_name'].unique()
array(['Wood Anemone', 'Red Dead-nettle ', 'Red Dead-nettle',
'Lesser Celandine', 'Common Dandelion', 'Lesser Celandine\xa0',
'Daisy', 'Chickweed', 'Marsh-marigold ', 'Marsh-marigold',
'Cuckooflower ', 'Cuckooflower', 'Butterbur', "Miner's Lettuce",
"Miner's Lettuce ", 'Cow Parsley', 'Thale Cress', 'Common Nettle',
'Common Water-crowfoot ', 'Green Alkanet', 'White Nettle',
'Bittercress', 'Redstem Filaree\xa0', 'Field forgot-me-not',
"Dove's-foot Crane's Bill", 'Creeping Buttercup', 'Marsh Valerian',
'Ribwort Plantain', 'Bugle ', 'Marsh Thistle', 'Yellow Flag',
'Bluebell', 'Red Clover', 'Sorrel', 'Ox-eye Daisy ',
'Ragged-robin', 'Turnip Rape', 'Common Vetch', 'Greater Celandine',
'Flatweed', 'Groundsel', 'White Clover', 'Marsh Lousewort',
'Comfrey', "Shepherd's Purse", 'Field pansy', 'Lesser Trefoil',
'Field Mouse-ear', 'Meadow Buttercup', 'Common Lousewort',
'True Forget-me-not ', 'Hidcote Pink', 'Broad-leaved Marsh Orchid',
'Common Twayblade', 'Tormentil', 'Meadow Thistle',
'Germander Speedwell', 'Ground-ivy', 'Common Poppy',
'Lesser Trefoil ', 'Southern Marsh Orchid', 'Bistort',
'Bitter Dock', 'Ox-eye Daisy', 'Chamomile', 'Cornflower',
"Little White Bird's-foot", 'Red Campion', 'Ground Elder',
'Goldmoss Stonecrop', 'Narrow-leaved Rattle', 'Hairy Tare',
'Hoary Alyssum', 'Heath Spotted Orchid', 'Tufted Loosestrife',
'Bog Stitchwort', 'Yarrow', "Bird's-foot Trefoil ", 'Tufted Vetch',
'Silverweed', 'White Campion', "Viper's Bugloss",
'Lesser Butterfly Orchid', 'Purple Marshlocks',
"Perforate St John's-wort", 'Brown Knapweed', 'Ragwort',
"Bird's-foot Trefoil", "Hare's-foot clover ",
'Large-flowered Evening-primrose', "Hare's-foot Clover",
"Sheep's-bit", 'Bladder Campion', 'Kidney Vetch', 'Common Bugloss',
'Hungarian Mullein', 'Golden Marguerite', 'Field Bindweed',
'Smooth Hawksbeard', 'White Waterlily', 'Broad-leaved Helleborine',
'Spear Thistle', 'Greater Musk-mallow', 'Maiden Pink',
'Woodland Sage', 'Corn Marigold', 'Meadowsweet',
'Cross-leaved Heath', 'Purple Loosestrife', 'Hedge Bindweed',
'Yellow Loosestrife', 'Hemp-agrimony', 'Creeping Thistle', 'Tansy',
'Valerian', 'Wild Carrot', 'Sneezewort', 'Longroot Smartweed ',
'Gypsywort', 'Lesser Spearwort', "Lady's Thumb", 'Parsnip',
'Hogweed', 'Common Hemp-nettle', 'Water Mint',
'European Water-plantain', 'Broadleaf Plantain', 'Black Medick',
'California Poppy'], dtype=object)
len(data['english_name'].unique())
129
In the previous two steps, we noticed that some of the unique names are repeating due to the whitespaces at the end of a few cells. Furthermore, an example can be observed in the first row. Thus, we had to remove each empty space at the end of each cell. The code for this is shown in the next line.
data['english_name']=data['english_name'].str.replace(' +$', '')
<ipython-input-7-c266056b52e1>:1: FutureWarning: The default value of regex will change from True to False in a future version.
data['english_name']=data['english_name'].str.replace(' +$', '')
# Number of unique names after removing all whitespaces at the end.
len(data['english_name'].unique())
122
data['english_name'].unique()
array(['Wood Anemone', 'Red Dead-nettle', 'Lesser Celandine',
'Common Dandelion', 'Lesser Celandine\xa0', 'Daisy', 'Chickweed',
'Marsh-marigold', 'Cuckooflower', 'Butterbur', "Miner's Lettuce",
'Cow Parsley', 'Thale Cress', 'Common Nettle',
'Common Water-crowfoot', 'Green Alkanet', 'White Nettle',
'Bittercress', 'Redstem Filaree\xa0', 'Field forgot-me-not',
"Dove's-foot Crane's Bill", 'Creeping Buttercup', 'Marsh Valerian',
'Ribwort Plantain', 'Bugle', 'Marsh Thistle', 'Yellow Flag',
'Bluebell', 'Red Clover', 'Sorrel', 'Ox-eye Daisy', 'Ragged-robin',
'Turnip Rape', 'Common Vetch', 'Greater Celandine', 'Flatweed',
'Groundsel', 'White Clover', 'Marsh Lousewort', 'Comfrey',
"Shepherd's Purse", 'Field pansy', 'Lesser Trefoil',
'Field Mouse-ear', 'Meadow Buttercup', 'Common Lousewort',
'True Forget-me-not', 'Hidcote Pink', 'Broad-leaved Marsh Orchid',
'Common Twayblade', 'Tormentil', 'Meadow Thistle',
'Germander Speedwell', 'Ground-ivy', 'Common Poppy',
'Southern Marsh Orchid', 'Bistort', 'Bitter Dock', 'Chamomile',
'Cornflower', "Little White Bird's-foot", 'Red Campion',
'Ground Elder', 'Goldmoss Stonecrop', 'Narrow-leaved Rattle',
'Hairy Tare', 'Hoary Alyssum', 'Heath Spotted Orchid',
'Tufted Loosestrife', 'Bog Stitchwort', 'Yarrow',
"Bird's-foot Trefoil", 'Tufted Vetch', 'Silverweed',
'White Campion', "Viper's Bugloss", 'Lesser Butterfly Orchid',
'Purple Marshlocks', "Perforate St John's-wort", 'Brown Knapweed',
'Ragwort', "Hare's-foot clover", 'Large-flowered Evening-primrose',
"Hare's-foot Clover", "Sheep's-bit", 'Bladder Campion',
'Kidney Vetch', 'Common Bugloss', 'Hungarian Mullein',
'Golden Marguerite', 'Field Bindweed', 'Smooth Hawksbeard',
'White Waterlily', 'Broad-leaved Helleborine', 'Spear Thistle',
'Greater Musk-mallow', 'Maiden Pink', 'Woodland Sage',
'Corn Marigold', 'Meadowsweet', 'Cross-leaved Heath',
'Purple Loosestrife', 'Hedge Bindweed', 'Yellow Loosestrife',
'Hemp-agrimony', 'Creeping Thistle', 'Tansy', 'Valerian',
'Wild Carrot', 'Sneezewort', 'Longroot Smartweed', 'Gypsywort',
'Lesser Spearwort', "Lady's Thumb", 'Parsnip', 'Hogweed',
'Common Hemp-nettle', 'Water Mint', 'European Water-plantain',
'Broadleaf Plantain', 'Black Medick', 'California Poppy'],
dtype=object)
From the EDA that we have made, we have plotted all types of units and we noticed that two typing mistakes were made. Therefore, we are replacing the single values with the same name as the unit with multiple values. ("spike - Spike; "Umrel - Umbrel").
data['unit'] = data['unit'].str.replace('spike','Spike')
data['unit'] = data['unit'].str.replace('Umrel','Umbrel')
data['unit'] = data['unit'].str.replace('Single Flower','Single flower')
Since the column "location" contains the city and extra information regarding where the photo was taken such as district, we have splitted the values of the column into two different columns - "location(the city)" and "district".
data['district'] = data['location'].str.split('|').str[1]
data['location'] = data['location'].str.split('|').str[0]
data.head(10)
| date | landscape | location | photo_id | english_name | unit | other_flower | district | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2 April, 2021 | Woodland | Helmond | _B4A2382 | Wood Anemone | Single flower | NaN | Warandabos |
| 1 | NaN | Woodland | Helmond | _B4A2389 | Wood Anemone | Single flower | NaN | Warandabos |
| 2 | NaN | Woodland | Helmond | _B4A2399 | Wood Anemone | Single flower | NaN | Warandabos |
| 3 | NaN | Woodland | Helmond | _B4A2402 | Wood Anemone | Single flower | NaN | Warandabos |
| 4 | NaN | Woodland | Helmond | _B4A2403 | Wood Anemone | Single flower | NaN | Warandabos |
| 5 | NaN | Woodland | Helmond | _B4A2404 | Wood Anemone | Single flower | NaN | Warandabos |
| 6 | NaN | Woodland | Helmond | _B4A2405 | Wood Anemone | Single flower | NaN | Warandabos |
| 7 | NaN | Urban | Helmond | _B4A2406 | Red Dead-nettle | Spike | NaN | industrieterrein |
| 8 | NaN | Urban | Helmond | _B4A2407 | Red Dead-nettle | Spike | NaN | industrieterrein |
| 9 | NaN | Urban | Helmond | _B4A2408 | Red Dead-nettle | Spike | NaN | industrieterrein |
Since the date is only filled for the first row per visit, the following code fills in the NaN values of the rows with missing data. The funcition used is ‘ffill’ which stands for ‘forward fill’ and replaces the missing values with the corresponding value in the previous row.
data['date'] = data['date'].ffill()
data.head(10)
| date | landscape | location | photo_id | english_name | unit | other_flower | district | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2 April, 2021 | Woodland | Helmond | _B4A2382 | Wood Anemone | Single flower | NaN | Warandabos |
| 1 | 2 April, 2021 | Woodland | Helmond | _B4A2389 | Wood Anemone | Single flower | NaN | Warandabos |
| 2 | 2 April, 2021 | Woodland | Helmond | _B4A2399 | Wood Anemone | Single flower | NaN | Warandabos |
| 3 | 2 April, 2021 | Woodland | Helmond | _B4A2402 | Wood Anemone | Single flower | NaN | Warandabos |
| 4 | 2 April, 2021 | Woodland | Helmond | _B4A2403 | Wood Anemone | Single flower | NaN | Warandabos |
| 5 | 2 April, 2021 | Woodland | Helmond | _B4A2404 | Wood Anemone | Single flower | NaN | Warandabos |
| 6 | 2 April, 2021 | Woodland | Helmond | _B4A2405 | Wood Anemone | Single flower | NaN | Warandabos |
| 7 | 2 April, 2021 | Urban | Helmond | _B4A2406 | Red Dead-nettle | Spike | NaN | industrieterrein |
| 8 | 2 April, 2021 | Urban | Helmond | _B4A2407 | Red Dead-nettle | Spike | NaN | industrieterrein |
| 9 | 2 April, 2021 | Urban | Helmond | _B4A2408 | Red Dead-nettle | Spike | NaN | industrieterrein |
# converting the date object format to datetime format
data['date'] = pd.to_datetime(data.date)
data['month'] = data['date'].dt.strftime('%B')
# rearranging column positions
column_names = 'date', 'month', 'landscape', 'location', 'district', 'photo_id', 'english_name', 'unit', 'other_flower'
data = data.reindex(columns = column_names)
data.head(10)
| date | month | landscape | location | district | photo_id | english_name | unit | other_flower | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2021-04-02 | April | Woodland | Helmond | Warandabos | _B4A2382 | Wood Anemone | Single flower | NaN |
| 1 | 2021-04-02 | April | Woodland | Helmond | Warandabos | _B4A2389 | Wood Anemone | Single flower | NaN |
| 2 | 2021-04-02 | April | Woodland | Helmond | Warandabos | _B4A2399 | Wood Anemone | Single flower | NaN |
| 3 | 2021-04-02 | April | Woodland | Helmond | Warandabos | _B4A2402 | Wood Anemone | Single flower | NaN |
| 4 | 2021-04-02 | April | Woodland | Helmond | Warandabos | _B4A2403 | Wood Anemone | Single flower | NaN |
| 5 | 2021-04-02 | April | Woodland | Helmond | Warandabos | _B4A2404 | Wood Anemone | Single flower | NaN |
| 6 | 2021-04-02 | April | Woodland | Helmond | Warandabos | _B4A2405 | Wood Anemone | Single flower | NaN |
| 7 | 2021-04-02 | April | Urban | Helmond | industrieterrein | _B4A2406 | Red Dead-nettle | Spike | NaN |
| 8 | 2021-04-02 | April | Urban | Helmond | industrieterrein | _B4A2407 | Red Dead-nettle | Spike | NaN |
| 9 | 2021-04-02 | April | Urban | Helmond | industrieterrein | _B4A2408 | Red Dead-nettle | Spike | NaN |
The code below is used to label the images according to the english name of the flower. Additionally, there is a function which concatenates the name of the flower to the name of the image. Also, the pictures which are not present in the excel file are stored in a different folder called "Extra images". Furthermore, a csv file with the duplicated images from the dataset (2 photos having the same photo_id) is created. Also, we have added screenshots of how the end result looks like.
# taking the photo ids
def get_photo_ids(data):
# initializing an empty array
photo_ids = []
# adding all ids to the array
for i in range(len(data)):
col_name = 'photo_id'
col_index = data.columns.get_loc(col_name)
current_id = data.iloc[i, col_index]
# adding .JPG to the photo id so it matches with the id of the photo
photo_ids.append(str(current_id) + '.JPG')
return photo_ids
# taking the english name of each flower
def get_eng_names(data):
# initializing an empty array
names = []
# adding all names to the array
for i in range(len(data)):
col_name = 'english_name'
col_index = data.columns.get_loc(col_name)
current_name = data.iloc[i, col_index]
names.append(current_name)
return names
def get_unique_names(data):
# getting the english names
eng_names = get_eng_names(data)
# initialize an empty array for the unique values
unique_names = []
# traverse for all elements
for x in eng_names:
# check if exists in unique_names or not
if x not in unique_names:
unique_names.append(x)
return unique_names
def to_one_folder(path):
# loading the images
images = listdir(path)
# looping through all three dates
for i in range(len(images)):
new_path = os.path.join(path, images[i])
new_images = listdir(new_path)
# looping though each folder for each date
for j in range(len(new_images)):
current_path = os.path.join(new_path, new_images[j])
current_images = listdir(current_path)
# moving the images to the parent directory and deleting the previous two
for img in current_images:
os.rename(current_path + '/' + img, path + img)
os.rmdir(current_path)
os.rmdir(new_path)
to_one_folder(path)
def check_data_for_missing_photos(path, data):
# checking tha data for duplicating ids (the original and the copy)
if data['photo_id'].duplicated().sum() > 0:
# saving the duplicated entries so they can be examinied
duplicates = data[data['photo_id'].duplicated(keep=False)]
duplicates.to_csv('./duplicated_photo_ids.csv')
# Dropping the duplicated entries
entries_to_drop = data[data['photo_id'].duplicated(keep='first')].index
data.drop(entries_to_drop, axis=0, inplace=True)
print('Not all ids were unique. Only the first instances were kept. Find the duplicated ids by loading duplicated_photo_ids.csv\n')
#raise Exception('Not all ids are unique. See the duplicated ids by loading duplicated_photo_ids.csv.')
photo_ids = get_photo_ids(data)
extra_images = []
images = listdir(path)
for img in images:
if img not in photo_ids:
extra_images.append(img)
if len(extra_images) == 0:
print('There are no extra photos\n')
else:
# Drop the extra images and add them in separate folder
destination = './Extra images/'
isDir = os.path.isdir(destination)
if not isDir:
os.mkdir(destination)
for i in extra_images:
for j in images:
if i == j:
os.rename(path + i, destination + i)
else:
continue
print(f'There were {len(extra_images)} extra photos that are not in the dataset. They have been moved to data/Extra images. \n')
return extra_images
extra_images = check_data_for_missing_photos(path, data)
Not all ids were unique. Only the first instances were kept. Find the duplicated ids by loading duplicated_photo_ids.csv There were 19 extra photos that are not in the dataset. They have been moved to data/Extra images.
duplicated_photos = pd.read_csv('./duplicated_photo_ids.csv')
duplicated_photos
| Unnamed: 0 | date | month | landscape | location | district | photo_id | english_name | unit | other_flower | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 77 | 2021-04-23 | April | NaN | Nederwetten | Dommeldal | _B4A3639 | Cuckooflower | Raceme | NaN |
| 1 | 83 | 2021-04-26 | April | NaN | Eindhoven | Bokt | _B4A3639 | Common Water-crowfoot | Single flower | NaN |
| 2 | 496 | 2021-06-09 | June | Roadside | Nuenen-Eindhoven | Europalaan & Sterrenlaan | _B4A7579 | Yarrow | Umbrel | NaN |
| 3 | 497 | 2021-06-09 | June | Roadside | Nuenen-Eindhoven | Europalaan & Sterrenlaan | _B4A7579 | Yarrow | Umbrel | Rode Klaver |
| 4 | 751 | 2021-07-27 | July | Nederwetten | Dommeldal | Pool (LT8) | NaN | _B4A9763 | Water Mint | Head | Grote Wederik (out-of-focus) |
| 5 | 752 | 2021-07-27 | July | Nederwetten | Dommeldal | Pool (LT8) | NaN | _B4A9763 | Gypsywort | Spike | Grote Waterweegbree, Watermunt (not blooming) |
# Function that renames the images by adding their english name to the photo id
def rename_images(path, data):
# loading the images
images = listdir(path)
# taking all photo ids
photo_ids = get_photo_ids(data)
# taking the english name of each flower
names = get_eng_names(data)
new_captions = []
for i in range(len(images)):
for j in range(len(photo_ids)):
if images[i] == photo_ids[j]:
caption = names[j] + images[i]
original_path = os.path.join(path, images[i])
new_path = os.path.join(path, caption)
os.replace(original_path, new_path)
rename_images(path, data)
def label_images(path, data):
# load the images
images = listdir(path)
# get the unique names of the flowers
u_names= get_unique_names(data)
# taking all photo ids
photo_ids = get_photo_ids(data)
# create a directory for each name and move the corresponding pictures to the new directories
for u_n in u_names:
for img in images:
for ids in photo_ids:
if u_n + ids == img:
destination = os.path.join(path, u_n)
isDir = os.path.isdir(destination)
if isDir:
# add / to the unique names to they can be used as directories
os.rename(path + img, destination + '/' + img)
else:
os.mkdir(destination)
os.rename(path + img, destination + '/' + img)
print('Labeling has finished.')
label_images(path, data)
Labeling has finished.
Overall, the excel file needed some pre-processing due to the mulitple typos. Moreover, the client had whitespaces in some of the column names as well as in some of the rows. Also, the excel file was not that well structured - around the table, there were different headings and paragraphs which are irrelevant for our AI model. Furthermore, the way in which the images were stored was a bit confusing and impractical for people who are not familiar with this field.
After the Data Preparation phase was completed, we have a structured environment with folders of flowers per flower name and a standardized excel file.